import datetime as dt
import pandas as pd
import dateutil as du

Though economic or financial time series data are neatly labeled with time stamp which doesn’t require much extra manipulations, we will still cover the basics of datetime library in this chapter. It will be useful when engaging in commercial time series data rather than economic or financial ones.

All necessary libraries have been imported at the top of file. First we create a date object on Christmas in 2021.

christmas_2021 = dt.date(2021, 12, 25)

christmas_2021 is not a string, it is a date object.

print(christmas_2021)
type(christmas_2021)

2021-12-25

datetime.date

We can check what methods and attributes it has, we filter out the magic methods with a loop.

def filter_dunder(any_obj):
    temp_list = dir(any_obj)
    date_obj_meth_attr = []
    for i in temp_list:
        if i[0:2] != "__":
            date_obj_meth_attr.append(i)
    date_obj_meth_attr = {"meth_attr": date_obj_meth_attr}
    return pd.DataFrame(date_obj_meth_attr)

date_obj_meth_attr = filter_dunder(christmas_2021)
date_obj_meth_attr

	meth_attr
0	ctime
1	day
2	fromisocalendar
3	fromisoformat
4	fromordinal
5	fromtimestamp
6	isocalendar
7	isoformat
8	isoweekday
9	max
10	min
11	month
12	replace
13	resolution
14	strftime
15	timetuple
16	today
17	toordinal
18	weekday
19	year

Obviously year, month, day and etc. are attributes.

print(christmas_2021.year)
print(christmas_2021.month)
print(christmas_2021.day)

2021
12
25

However, weekday is a method, which returns the day of the week, by default Monday \(=0\).

christmas_2021.weekday()

isocalendar() returns the number of weeks and day of the week.

christmas_2021.isocalendar()

datetime.IsoCalendarDate(year=2021, week=51, weekday=6)

Subtracting Days

Create another date object of 2021-11-11 which is Polish Independence Day (PID), then calculate how many days between PID and Christmas?

inde_day_2021 = dt.date(2021, 11, 11)  # Polish Independence day

date_delta = christmas_2021 - inde_day_2021
print(date_delta)
type(date_delta)

44 days, 0:00:00

datetime.timedelta

Note the result is an object called timedelta, it can be created mannually or automatically via any date objects’ subtraction.

delta_obj_meth_attr = filter_dunder(date_delta)
delta_obj_meth_attr

	meth_attr
0	days
1	max
2	microseconds
3	min
4	resolution
5	seconds
6	total_seconds

Check how many seconds there are between two dates.

date_delta.total_seconds()

3801600.0

Or days.

date_delta.days

Time delta can be specified by us. Here shows the date one week before PID.

offset = dt.timedelta(weeks=1)
inde_day_2021 - offset

datetime.date(2021, 11, 4)

Datetime Objects

datatime object grants us detailed time stamp at seconds’ level. For instance, we create a datetime object at midnight of Christmas then convert to string by .isoformat() method.

christmas_2021_midnight = dt.datetime(2021, 12, 25, 23, 59, 59)
christmas_2021_midnight.isoformat()

'2021-12-25T23:59:59'

datetime is more flexible and has much more methods and attributes.

filter_dunder(christmas_2021_midnight)

	meth_attr
0	astimezone
1	combine
2	ctime
3	date
4	day
5	dst
6	fold
7	fromisocalendar
8	fromisoformat
9	fromordinal
10	fromtimestamp
11	hour
12	isocalendar
13	isoformat
14	isoweekday
15	max
16	microsecond
17	min
18	minute
19	month
20	now
21	replace
22	resolution
23	second
24	strftime
25	strptime
26	time
27	timestamp
28	timetuple
29	timetz
30	today
31	toordinal
32	tzinfo
33	tzname
34	utcfromtimestamp
35	utcnow
36	utcoffset
37	utctimetuple
38	weekday
39	year

.replace() can modify the time attributes and assign onto an object.

christmas_eve_2022_noon = christmas_2021_midnight.replace(
    year=2022, month=12, day=24, hour=11, minute=59, second=59
)
print(christmas_eve_2022_noon)

2022-12-24 11:59:59

Here are most common attributes of datetime object.

print(christmas_2021_midnight.year)
print(christmas_2021_midnight.month)
print(christmas_2021_midnight.day)
print(christmas_2021_midnight.hour)
print(christmas_2021_midnight.minute)
print(christmas_2021_midnight.second)

Formats

.strptime() turn strings into datetime object (p means parse), .strftime() turns datetime object into strings (f means format).

christmas_2021.isoformat()  # ordinary ISO format

'2021-12-25'

Check here to find a proper format codes.

christmas_2021.strftime("%d-%m-%Y")

'25-12-2021'

christmas_2021.strftime("%d-%B-%Y")

'25-December-2021'

Use whatever separator you prefer, what matthers are the designated symbol right after %.

christmas_2021.strftime("%Y  ~*~+#  %m  ~*~#&*  %d")

'2021  ~*~+#  12  ~*~#&*  25'

christmas_2021.strftime("%Y, the number of day %j")

'2021, the number of day 359'

Load Data

Now we will try a dateset from Kaggle. It records the time, duration and location of bikes from a bike sharing business.

df_bike = pd.read_csv("../dataset/metro-bike-share-trip-data.csv")

df_bike.head(3)

	Trip ID	Duration	Start Time	End Time	Starting Station ID	Starting Station Latitude	Starting Station Longitude	Ending Station ID	Ending Station Latitude	Ending Station Longitude	...	Trip Route Category	Passholder Type	Starting Lat-Long	Ending Lat-Long	Neighborhood Councils (Certified)	Council Districts	Zip Codes	LA Specific Plans	Precinct Boundaries	Census Tracts
0	18222186	15060	2017-01-19T17:05:00.000	2017-01-19T21:16:00.000	3031.0	34.044701	-118.252441	3000.0	NaN	NaN	...	One Way	Walk-up	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	9518671	77160	2016-10-09T14:37:00.000	2016-10-10T12:03:00.000	NaN	NaN	NaN	3000.0	NaN	NaN	...	One Way	Monthly Pass	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	20444932	86400	2017-02-18T10:15:00.000	2017-02-20T15:20:00.000	3026.0	34.063179	-118.245880	3000.0	NaN	NaN	...	One Way	Walk-up	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

3 rows × 22 columns

Take a look at what data type is in Start Time and End Time.

type(df_bike["Start Time"][0])

str

They are strings, Python won’t understand it, therefore they must be converted it to datetime object before any futher analysis.

Parsing String As Datetime Object

By converting, we actually mean parsing. First we define a format variable that matches exactly the strings’ format in the dataframe.

And try an example by picking the first element of Start Time.

fmt = "%Y-%m-%dT%H:%M:%S.%f"
s = df_bike["Start Time"][0]

Parse it into datetime object.

d = dt.datetime.strptime(s, fmt)
d

datetime.datetime(2017, 1, 19, 17, 5)

As you can see, we have parsed the string into a datetime object, however to parse the whole column, the easier way would be applying lambda function onto the whole column of strings.

df_bike["StartTime_Parsed"] = df_bike["Start Time"].apply(
    lambda x: dt.datetime.strptime(x, fmt)
)
df_bike["EndTime_Parsed"] = df_bike["End Time"].apply(
    lambda x: dt.datetime.strptime(x, fmt)
)

The newly added columns are at the rightmost.

df_bike.head(3)

	Trip ID	Duration	Start Time	End Time	Starting Station ID	Starting Station Latitude	Starting Station Longitude	Ending Station ID	Ending Station Latitude	Ending Station Longitude	...	Starting Lat-Long	Ending Lat-Long	Neighborhood Councils (Certified)	Council Districts	Zip Codes	LA Specific Plans	Precinct Boundaries	Census Tracts	StartTime_Parsed	EndTime_Parsed
0	18222186	15060	2017-01-19T17:05:00.000	2017-01-19T21:16:00.000	3031.0	34.044701	-118.252441	3000.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2017-01-19 17:05:00	2017-01-19 21:16:00
1	9518671	77160	2016-10-09T14:37:00.000	2016-10-10T12:03:00.000	NaN	NaN	NaN	3000.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2016-10-09 14:37:00	2016-10-10 12:03:00
2	20444932	86400	2017-02-18T10:15:00.000	2017-02-20T15:20:00.000	3026.0	34.063179	-118.245880	3000.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2017-02-18 10:15:00	2017-02-20 15:20:00

3 rows × 24 columns

Duration can be conveniently computed by subtraction.

df_bike["duration"] = df_bike["EndTime_Parsed"] - df_bike["StartTime_Parsed"]
df_bike.head()

	Trip ID	Duration	Start Time	End Time	Starting Station ID	Starting Station Latitude	Starting Station Longitude	Ending Station ID	Ending Station Latitude	Ending Station Longitude	...	Ending Lat-Long	Neighborhood Councils (Certified)	Council Districts	Zip Codes	LA Specific Plans	Precinct Boundaries	Census Tracts	StartTime_Parsed	EndTime_Parsed	duration
0	18222186	15060	2017-01-19T17:05:00.000	2017-01-19T21:16:00.000	3031.0	34.044701	-118.252441	3000.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2017-01-19 17:05:00	2017-01-19 21:16:00	0 days 04:11:00
1	9518671	77160	2016-10-09T14:37:00.000	2016-10-10T12:03:00.000	NaN	NaN	NaN	3000.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2016-10-09 14:37:00	2016-10-10 12:03:00	0 days 21:26:00
2	20444932	86400	2017-02-18T10:15:00.000	2017-02-20T15:20:00.000	3026.0	34.063179	-118.245880	3000.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2017-02-18 10:15:00	2017-02-20 15:20:00	2 days 05:05:00
3	20905031	18840	2017-02-27T12:26:00.000	2017-02-27T17:40:00.000	3023.0	34.050911	-118.240967	3000.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2017-02-27 12:26:00	2017-02-27 17:40:00	0 days 05:14:00
4	21031476	86400	2017-02-27T20:26:00.000	2017-03-01T09:49:00.000	3008.0	34.046612	-118.262733	3000.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2017-02-27 20:26:00	2017-03-01 09:49:00	1 days 13:23:00

5 rows × 25 columns

print("Minimum duration: " + str(min(df_bike["duration"])))
print("Maximum duration: " + str(max(df_bike["duration"])))

Minimum duration: 0 days 00:01:00
Maximum duration: 5 days 16:28:00

To show total seconds of duration.

df_bike["duration"].dt.total_seconds()

0          15060.0
1          77160.0
2         191100.0
3          18840.0
4         134580.0
            ...   
132422      2520.0
132423      1080.0
132424      1500.0
132425       780.0
132426       300.0
Name: duration, Length: 132427, dtype: float64

Compute the mean duration.

df_bike["duration"].mean()

Timedelta('0 days 00:28:13.423848610')

Unix Timestamp

The Unix timestamp is the number of seconds elapsed since January 1st, 1970 at UTC, which is supposedly the day of first modern computer invented.

Take a look the example below, you will grasp the meaning of Unix timestamp.

timestamps = [0, 1, 2, 3000000, 100000000, 1600000000, 2073658356]

for ts in timestamps:
    print(dt.datetime.fromtimestamp(ts))

1970-01-01 00:00:00
1970-01-01 00:00:01
1970-01-01 00:00:02
1970-02-04 17:20:00
1973-03-03 09:46:40
2020-09-13 12:26:40
2035-09-17 16:12:36

Time Zone And UTC Offset

UTC stands for Coordinated Universal Time, which is the reference time zone to regulate the world’s clock. UK winter time is UTC+0, any time different from it is called UTC offset, for instance the UTC offset of Poland in winter is UTC+1, it means Poland is one hour ahead than UK in the winter.

You can specify the time zone info for the datetime object, here we use UTC.

national_day_event_UK_2022 = dt.datetime(
    2022, 4, 23, 11, 14, 15, tzinfo=dt.timezone.utc
)

So UTC offset is naturally \(+00:00\).

national_day_event_UK_2022.isoformat()

'2022-04-23T11:14:15+00:00'

Create a timezone for Eastern European Time, UTC+2, which is the winter time in Helsinki/Tallinn. The method is to create a timedelta, then use it as the UTC offset, tzinfo=eet.

eet = dt.timezone(dt.timedelta(hours=2))

national_day_event_UK_2022 = dt.datetime(2022, 4, 23, 18, 14, 15, tzinfo=eet)
national_day_event_UK_2022.isoformat()

'2022-04-23T18:14:15+02:00'

If you think the time zone or UTC offset was a mistake, then replace it with a new one. Here I just print it out, rather than assign to a new datetime object.

beijing_time = dt.timezone(dt.timedelta(hours=8))
print(national_day_event_UK_2022.replace(tzinfo=beijing_time))

2022-04-23 18:14:15+08:00

Display local time in new time zone.

national_day_event_UK_2022.astimezone(dt.timezone.utc).isoformat()

'2022-04-23T16:14:15+00:00'

IANA Timezone

If you know location, e.g. city name, you can call IANA timezone time directly, such as Europe/Paris or Asia/Chungking.

We will use dateutil library, which has been imported at the top of this file.

ck_time = du.tz.gettz("Asia/Chungking")

national_day_event_UK_2022.astimezone(ck_time).isoformat()

'2022-04-24T00:14:15+08:00'

ny_time = du.tz.gettz("America/New_York")
national_day_event_UK_2022.astimezone(ny_time).isoformat()

'2022-04-23T12:14:15-04:00'

Winter Time Change

Time change is a common practice in European Union. For instance, Poland changes to winter time at 3:00AM, 31th Oct 2021, which means the clock rewinds back to 2:00AM at 3:00AM. We will take a look at time offset before and after the change.

before = dt.datetime(2021, 10, 31, 1, 0, 0, tzinfo=du.tz.gettz("Europe/Warsaw"))
after = before + dt.timedelta(hours=6)
print(before.isoformat() + " to " + after.isoformat())

2021-10-31T01:00:00+02:00 to 2021-10-31T07:00:00+01:00

As you can see the UTC offset changed from +02:00 to +01:00 if we add \(6\) hours across the time change hour, however the time change doesn’t seem to reflect in the hours addition? Elapsing \(6\) hours from T01:00:00 should be at T06:00:00 due to one hour shift backward.

We can verify how many hours that Python assumes elapsed.

(after - before).seconds / (60**2)

6.0

If we use UTC as a reference time zone.

utc_delta = after.astimezone(dt.timezone.utc) - before.astimezone(dt.timezone.utc)
utc_delta.seconds / (60**2)

7.0

after.astimezone(dt.timezone.utc)

datetime.datetime(2021, 10, 31, 6, 0, tzinfo=datetime.timezone.utc)

before.astimezone(dt.timezone.utc)

datetime.datetime(2021, 10, 30, 23, 0, tzinfo=datetime.timezone.utc)

Now use a loop to show if Oct 31 is an ambiguous date.

end_Oct = dt.datetime(2000, 10, 31, 1, 0, 0, tzinfo=du.tz.gettz("Europe/Warsaw"))
for y in range(2000, 2050):
    print(end_Oct.replace(year=y).isoformat())

2000-10-31T01:00:00+01:00
2001-10-31T01:00:00+01:00
2002-10-31T01:00:00+01:00
2003-10-31T01:00:00+01:00
2004-10-31T01:00:00+02:00
2005-10-31T01:00:00+01:00
2006-10-31T01:00:00+01:00
2007-10-31T01:00:00+01:00
2008-10-31T01:00:00+01:00
2009-10-31T01:00:00+01:00
2010-10-31T01:00:00+02:00
2011-10-31T01:00:00+01:00
2012-10-31T01:00:00+01:00
2013-10-31T01:00:00+01:00
2014-10-31T01:00:00+01:00
2015-10-31T01:00:00+01:00
2016-10-31T01:00:00+01:00
2017-10-31T01:00:00+01:00
2018-10-31T01:00:00+01:00
2019-10-31T01:00:00+01:00
2020-10-31T01:00:00+01:00
2021-10-31T01:00:00+02:00
2022-10-31T01:00:00+01:00
2023-10-31T01:00:00+01:00
2024-10-31T01:00:00+01:00
2025-10-31T01:00:00+01:00
2026-10-31T01:00:00+01:00
2027-10-31T01:00:00+02:00
2028-10-31T01:00:00+01:00
2029-10-31T01:00:00+01:00
2030-10-31T01:00:00+01:00
2031-10-31T01:00:00+01:00
2032-10-31T01:00:00+02:00
2033-10-31T01:00:00+01:00
2034-10-31T01:00:00+01:00
2035-10-31T01:00:00+01:00
2036-10-31T01:00:00+01:00
2037-10-31T01:00:00+01:00
2038-10-31T01:00:00+01:00
2039-10-31T01:00:00+01:00
2040-10-31T01:00:00+01:00
2041-10-31T01:00:00+01:00
2042-10-31T01:00:00+01:00
2043-10-31T01:00:00+01:00
2044-10-31T01:00:00+01:00
2045-10-31T01:00:00+01:00
2046-10-31T01:00:00+01:00
2047-10-31T01:00:00+01:00
2048-10-31T01:00:00+01:00
2049-10-31T01:00:00+01:00

Ambiguous Datetime

Ambiguous time happens when there is time change, at certain time can appear twice. du.tz.datetime_ambiguous() can test this.

dt_1 = dt.datetime(2020, 10, 31, 2, 35, 10, tzinfo=du.tz.gettz("Europe/Warsaw"))
du.tz.datetime_ambiguous(dt_1)

False

At 3AM, 31th Oct 2021, the clock will be dialed back one hour at 2AM in Poland. So any datetime object appears between 2AM and 3AM will appear twice, this is what we call ambiguous time.

dt_2 = dt.datetime(2021, 10, 31, 2, 35, 10, tzinfo=du.tz.gettz("Europe/Warsaw"))
du.tz.datetime_ambiguous(dt_2)

True

Parsing Datetime With Pandas

Load data again and check the info summary.

You can see the Start Time and End Time are strings (Dtype: object).

df_bike = pd.read_csv("../dataset/metro-bike-share-trip-data.csv")
df_bike.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132427 entries, 0 to 132426
Data columns (total 22 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Trip ID                            132427 non-null  int64  
 1   Duration                           132427 non-null  int64  
 2   Start Time                         132427 non-null  object 
 3   End Time                           132427 non-null  object 
 4   Starting Station ID                132408 non-null  float64
 5   Starting Station Latitude          132379 non-null  float64
 6   Starting Station Longitude         132379 non-null  float64
 7   Ending Station ID                  132331 non-null  float64
 8   Ending Station Latitude            131376 non-null  float64
 9   Ending Station Longitude           131376 non-null  float64
 10  Bike ID                            132417 non-null  float64
 11  Plan Duration                      131661 non-null  float64
 12  Trip Route Category                132427 non-null  object 
 13  Passholder Type                    132427 non-null  object 
 14  Starting Lat-Long                  98622 non-null   object 
 15  Ending Lat-Long                    131376 non-null  object 
 16  Neighborhood Councils (Certified)  98590 non-null   float64
 17  Council Districts                  98590 non-null   float64
 18  Zip Codes                          98590 non-null   float64
 19  LA Specific Plans                  15803 non-null   float64
 20  Precinct Boundaries                131328 non-null  float64
 21  Census Tracts                      131328 non-null  float64
dtypes: float64(14), int64(2), object(6)
memory usage: 22.2+ MB

We can parse the dates directly with parse_dates, then info summary shows they are converted to datetime.

df_bike = pd.read_csv(
    "../dataset/metro-bike-share-trip-data.csv", parse_dates=["Start Time", "End Time"]
)
df_bike.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132427 entries, 0 to 132426
Data columns (total 22 columns):
 #   Column                             Non-Null Count   Dtype         
---  ------                             --------------   -----         
 0   Trip ID                            132427 non-null  int64         
 1   Duration                           132427 non-null  int64         
 2   Start Time                         132427 non-null  datetime64[ns]
 3   End Time                           132427 non-null  datetime64[ns]
 4   Starting Station ID                132408 non-null  float64       
 5   Starting Station Latitude          132379 non-null  float64       
 6   Starting Station Longitude         132379 non-null  float64       
 7   Ending Station ID                  132331 non-null  float64       
 8   Ending Station Latitude            131376 non-null  float64       
 9   Ending Station Longitude           131376 non-null  float64       
 10  Bike ID                            132417 non-null  float64       
 11  Plan Duration                      131661 non-null  float64       
 12  Trip Route Category                132427 non-null  object        
 13  Passholder Type                    132427 non-null  object        
 14  Starting Lat-Long                  98622 non-null   object        
 15  Ending Lat-Long                    131376 non-null  object        
 16  Neighborhood Councils (Certified)  98590 non-null   float64       
 17  Council Districts                  98590 non-null   float64       
 18  Zip Codes                          98590 non-null   float64       
 19  LA Specific Plans                  15803 non-null   float64       
 20  Precinct Boundaries                131328 non-null  float64       
 21  Census Tracts                      131328 non-null  float64       
dtypes: datetime64[ns](2), float64(14), int64(2), object(4)
memory usage: 22.2+ MB

Or another method to use to_datetime function.

df_bike = pd.read_csv("../dataset/metro-bike-share-trip-data.csv")
df_bike["Start Time"] = pd.to_datetime(
    df_bike["Start Time"], format="%Y-%m-%dT%H:%M:%S.%f"
)
df_bike.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132427 entries, 0 to 132426
Data columns (total 22 columns):
 #   Column                             Non-Null Count   Dtype         
---  ------                             --------------   -----         
 0   Trip ID                            132427 non-null  int64         
 1   Duration                           132427 non-null  int64         
 2   Start Time                         132427 non-null  datetime64[ns]
 3   End Time                           132427 non-null  object        
 4   Starting Station ID                132408 non-null  float64       
 5   Starting Station Latitude          132379 non-null  float64       
 6   Starting Station Longitude         132379 non-null  float64       
 7   Ending Station ID                  132331 non-null  float64       
 8   Ending Station Latitude            131376 non-null  float64       
 9   Ending Station Longitude           131376 non-null  float64       
 10  Bike ID                            132417 non-null  float64       
 11  Plan Duration                      131661 non-null  float64       
 12  Trip Route Category                132427 non-null  object        
 13  Passholder Type                    132427 non-null  object        
 14  Starting Lat-Long                  98622 non-null   object        
 15  Ending Lat-Long                    131376 non-null  object        
 16  Neighborhood Councils (Certified)  98590 non-null   float64       
 17  Council Districts                  98590 non-null   float64       
 18  Zip Codes                          98590 non-null   float64       
 19  LA Specific Plans                  15803 non-null   float64       
 20  Precinct Boundaries                131328 non-null  float64       
 21  Census Tracts                      131328 non-null  float64       
dtypes: datetime64[ns](1), float64(14), int64(2), object(5)
memory usage: 22.2+ MB

Pandas can also set the timezone by tz_localize().

df_bike["Start Time"].head(3).dt.tz_localize("Europe/Helsinki")

0   2017-01-19 17:05:00+02:00
1   2016-10-09 14:37:00+03:00
2   2017-02-18 10:15:00+02:00
Name: Start Time, dtype: datetime64[ns, Europe/Helsinki]